Final Project: A protein expression analysis of Breast Cancer

Group 23

Introduction

The data set consists of:

  • iTRAQ proteome profiling of 77 breast cancer samples + 3 healthy samples, with expression values for ~12.000 proteins for each sample.

  • A file containing the clinical data of the 77 breast cancer patients (TCGA ID, sex, age, tumor receptors, etc.).

  • A file containing the list of genes and proteins used by the PAM50 classification system.

The analysis of this data set is relevant for multiple potential applications: expression analysis to identify biomarkers, understand disease heterogeneity, and infer personalized treatment strategies in breast cancer.

Materials and methods

Workflow

Materials and methods

Description of relevant variables

  • Dropped variables (redundant or not relevant): Survival.Data.Form, Days.to.date.of.Death, Days.to.Date.of.Last.Contact, OS.Time, Vital.Status, Tumor..T1.Coded, Metastasis.Coded, AJCC.Stage, Converted.Stage and all the columns destined to cluster the data

  • Created variables:

    • Age.Ini.Diagnostic.group: intervals of 10 years starting from 30 and going up until 90.

    • Age.Menopausal.group: [30, 45) Pre-menopausal, [45, 55) Menopausal, [55, 90) Post-menopausal.

    • ER_PR_HER2: level from 0 to 7 depending on hormonal receptors (ER, PR) present an the level of HER2.

    • TNBC: 0 if positive, 1 if negative

    • AJCC.Simp: simplified AJCC stages (I, II, III, and IV)

Materials and methods

Description of relevant variables

Analysis 1. PCA.

Analysis 3: Differential expression

Two functions were created allowing to easily perform several comparisons with the present data.

DEA_proteins()

DEA_proteins <- function(data_in, condition_test){
  col_name <- deparse(substitute(condition_test))
  data_long <- data_in |>
    dplyr::select(matches("^NP"),
                  matches("^XP"),
                  matches("^YP"),
                  {{ condition_test }}) |>   
    pivot_longer(cols = -{{ condition_test }},
                 names_to = "Protein",
                 values_to = "log2_iTRAQ")
  
  data_long_nested <- data_long |>
    group_by(Protein) |>
    nest() |>
    ungroup()
    
  data_w_model <- data_long_nested |>
    group_by(Protein) |>
    mutate(model_object = map(.x = data,
                              .f = ~lm(formula = str_c("log2_iTRAQ ~", col_name) ,
                                       data = .x)))

  data_w_model <- data_w_model |>
    mutate(model_object_tidy = map(.x = model_object,
                                   .f = ~tidy(.x,
                                              conf.int = TRUE,
                                              conf.level = 0.95)))
  
  estimates <- data_w_model |>
    unnest(model_object_tidy) |>
    filter(term == col_name) |>
    ungroup() |>
    dplyr::select(Protein, p.value, estimate, conf.low, conf.high) |>
    mutate(q.value = p.adjust(p.value)) |>
    mutate(dif_exp = case_when(q.value <= 0.05 & estimate > 0 ~ "Up",
                               q.value <= 0.05 & estimate < 0 ~ "Down",
                               q.value > 0.05 ~ 'NS'))

  plt_volcano <- volcano_plot(estimates, col_name)
  return(list(estimates=estimates, plt_volcano=plt_volcano))
}

volcano_plot()

volcano_plot <- function(data, condition_test){
  plt <- data |>
    group_by(dif_exp) |>
    mutate(label = case_when(dif_exp == "Up" ~  str_c(dif_exp,
                                                       " (Count: ",
                                                       n(),
                                                       ")" ),
                             dif_exp == "Down" ~  str_c(dif_exp,
                                                         " (Count: ",
                                                         n(),
                                                         ")" ),
                             dif_exp == "NS" ~  str_c(dif_exp))) |>
    ggplot(aes(x = estimate,
               y = -log10(p.value),
               colour = label)) +
    geom_point(alpha = 0.4,
               shape = "circle") +
    labs(title = str_c("Differentially expressed proteins in the test: ",
                        condition_test,
                        " vs. Non-",
                        condition_test),
         subtitle = "Proteins highlighted in either red or blue were 
         \nsignificant after multiple test correction",
         x = "Estimates", 
         y = expression(-log[10]~(p)),
         color = "Differential expression") +
    scale_color_manual(values = c("blue",
                                  "grey",
                                  "red")) +
    theme_minimal() +
    theme(legend.position = "right",
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) 
  return(plt)
}

Analysis 3: Differential expression

NP_002094.2: glycogen [starch] synthase, muscle isoform 1